Harnessing the CRF Complexity with Domain-Specific Constraints. The Case of Morphosyntactic Tagging of a Highly Inflected Language
نویسنده
چکیده
We describe a domain-specific method of adapting conditional random fields (CRFs) to morphosyntactic tagging of highly-inflectional languages. The solution involves extending CRFs with additional, position-wise restrictions on the output domain, which are used to impose consistency between the modeled label sequences and morphosyntactic analysis results both at the level of decoding and, more importantly, in parameters estimation process. We decompose the problem of morphosyntactic disambiguation into two consecutive stages of the context-sensitive morphosyntactic guessing and the disambiguation proper. The division helps in designing well-adjusted, CRF-based methods for both tasks, which in combination constitute Concraft, a highly accurate tagging system for the Polish language available under the 2-clause BSD license. Evaluation on the National Corpus of Polish shows that our solution significantly outperforms other state-of-the-art taggers for Polish – Pantera, WMBT and WCRFT – especially in terms of the accuracy measured with respect to unknown words.
منابع مشابه
Arabic Morphosyntactic Raw Text Part of Speech Tagging System
Introduction and Overview: The topic of this dissertation is morphosyntactic part of speech tagging (abbreviated POS tagging) for Arabic. This topic has long and rich history for other languages, mainly for English. POS Tagging provides fundamental information about word forms used in sentences of natural language. The method of utilizing this information varies depending on the particular NLP ...
متن کاملIdentification of High-Frequency Morphosyntactic Structures in Persian-Speaking Children Aged 4-6 Years: A Qualitative Research
Background: Syntax has a high importance among linguistic parameters and the prevalence of syntax deficits is relatively high in children with language disorders. As such, independent examination of syntax in language development is of paramount importance. In this regard, Iranian language pathologists are faced with the lack of standardized tests. The present study aimed to determine the most ...
متن کاملA Study of Inflectional Categories of Noun in Sistani Dialect
The present article aims to provide a synchronic study of the inflectional or morpho-syntactic categories of noun in Sistani dialect. These categories comprise person, number, gender or noun class, definiteness, case, and possession. Linguistic data was collected via recording free speech, and interviewing with 30 (15 females, 15 males) illiterate Sistani language consultants of age 40–102 year...
متن کاملTurkish PoS Tagging by Reducing Sparsity with Morpheme Tags in Small Datasets
Sparsity is one of the major problems in natural language processing. The problem becomes even more severe in agglutinating languages that are highly prone to be inflected. We deal with sparsity in Turkish by adopting morphological features for part-of-speech tagging. We learn inflectional and derivational morpheme tags in Turkish by using conditional random fields (CRF) and we employ the morph...
متن کاملLanguage Sample Analysis of Children With Cleft Lip And Palate: A Comparative Study
Background: Cleft palate (CP) with or without cleft lip (CL/P) are the most common craniofacial birth defects. Cleft lip and palate (CLP) can affect children’s communication skills. The present study aimed to evaluate the language production skills in regards to morphology and syntax (morphosyntactic) of children with CLP . Method: In current cross-sectional study, 58 Persian-language child...
متن کامل